-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorporate boiler_id into epacamd_eia table #2558
base: main
Are you sure you want to change the base?
Conversation
note: this commit does not also update the alembic schema or remove the original epacamd_eia table from the db.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2558 +/- ##
=======================================
- Coverage 88.4% 88.4% -0.1%
=======================================
Files 88 88
Lines 10176 10176
=======================================
- Hits 9001 8999 -2
- Misses 1175 1177 +2
☔ View full report in Codecov by Sentry. |
@cmgosnell some thougts: Taking a closer look at the
When we augment the crosswalk with the generators_eia860 table we fill empty
I'm not sure this is the best idea, because many of these generators are renewables and thus don't have an associated smokestack as is implied by the existence of an
What is the impetus for equating |
…ssions_unit_id_epa with generator id. Also remove the replacement of empty plant_id_epa values with plant_id_eia to avoid confusion about which plants are reporting to EPA and which are not. Also add boiler id to the bga table merge in augement_crosswalk_with_bga_eia860. Noticed that there is an issue with some of the BGA relationships getting dropped during the subplant id creation, but I'm not sure how to fix this yet.
To address the above comment, instead of filling the NA gaps in |
BGA-to-subplant_id bugI noticed that we merge the sub-plant id table with the bga table, but the m:m bga boiler-generator relationship is not retained after the sub-plant id creation. I'm not exactly sure why, but I think it has to do with how the network x is currently calculating the sub-plant ids. I have a feeling there are some (hopefully) easy tweaks to fix this, but I'm not super familiar with network x. for example:
Vs.
|
hm good catch! i was wondering about this. it makes a degree of sense honestly that the boiler_id could be ignored/condensed. Conceptually it makes sense that the subplant id doesn't need to retain the detail about the boilers. from the generation side of things it should mostly care about the generators because that is largely what the epa linked to the smokestacks. So my first inclination would not be to go in and change the network_x process. (it might be good to drop the I thiiiink this means that if we want the boilers we need a 1:m merge of the bga table after the fact. since rn we have two version of the epacamd_eia glue table it would be good to ponder whether or not we should make a specific |
…e subplant_id table. Add code to merge the boiler_id back into the subplant_id table after the network x process because some boilers get dropped.
Restarted the migrations because I was getting the error |
src/pudl/etl/glue_assets.py
Outdated
# the network x process uses unit_id_pudl and generator_id. During processing, | ||
# the boiler_id data are truncated and we only retain unique values for generator_id | ||
# and unit_id_pudl. This step adds the lost boiler_id info back into the table. | ||
subplant_ids_updated = pd.merge( | ||
subplant_ids_updated.drop(columns="boiler_id").assign( | ||
unit_id_pudl=lambda x: x.unit_id_pudl.astype( | ||
"float" | ||
) # necessary step for tests | ||
), | ||
boiler_generator_assn_eia860[ | ||
[ | ||
"plant_id_eia", | ||
"generator_id", | ||
"boiler_id", | ||
"unit_id_pudl", | ||
] | ||
].drop_duplicates(), | ||
on=["plant_id_eia", "generator_id", "unit_id_pudl"], | ||
how="outer", | ||
).drop_duplicates() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that I added this chunk of code was because the network x process was dropping some of the boiler_ids (see comments above). This isn't pretty, but it remerges the subplant id table with the bga table so that we retain all the boiler-generator relationships.
There are two main reasons it's not ideal:
-
we already have a function called
augement_crosswalk_with_bga_eia860()
so it's weird / duplicative that we are merging the bga table with the subplant id table for a second time. (Can't find a way around this for now without changing how the network X is working which we agreed wasn't a good idea for right now). -
The
pytest test/unit/transform/glue_test.py::test_epacamd_eia_subplant_ids
was failing without the line that converts theunit_id_pudl
field to a float. This step makes the tests pass, but it causes the asset build to fail -.- grr:TypeError: Cannot interpret 'Int64Dtype()' as a data type
. I don't know how to get around it other than adding this column dtype conversion step into the tests rather than here. But I think that would cause the same error in the tests. I would prefer to avoid this dtype issue if possible! It wasn't an issue before this merge was added which I don't really understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on your second point, my first instinct is to add a convert_dtypes()
step in the unit tests for all input dfs. hard agree that this convert to float is not ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and on #1... maybe we should slightly generalize the augment_crosswalk_with_bga_eia860
. it looks like i? or you? idk added boiler_id
into that merge. but if the nextwork_x step doesn't care about boilers/effectively drops boiler detail then they shouldn't be in the table pre-network_x anyway.
We could give augment_crosswalk_with_bga_eia860
an idx: list[str]
arg that is either just ["plant_id_eia", "generator_id"]
or ["plant_id_eia", "generator_id", "boiler_id"]
and always boiler_generator_assn_eia860[idx + ["unit_id_pudl"]].drop_duplicates()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this function is used more than once so we could also just call it augment_crosswalk_with_unit_id_pudl
and remove boiler_id
from the merge cols
I think it would be really confusing to have a version of the subplant table with all the boilers and a version with just some or none. My vote is for a subplant table with all the boilers.
Technically it shouldn't provide anything new. If it does, we probably did something wrong. I think now that erroneous filling of It probably makes sense then to change the names to something like |
… fail in the emacamd_eia_subplant_ids function and add a convert_dtypes() function to the boiler_generator_assn_eia860_test table instead
…he metadata for the subplant_id table
I just removed the |
…ut removing that table from the database
…Replace epacamd_eia validation test with epacamd_eia_subplant_id validation test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this all looks good! two hopefully small requests for tweaks. one here and one in comment land.
if we are removing the old epacamd_eia
table (which i think is the right call!) can we rename the epacamd_eia_subplant_id
-> epacamd_eia
? its just such a nicer name.
["plant_id_eia", "generator_id", "unit_id_pudl"] | ||
[ | ||
"plant_id_eia", | ||
"generator_id", | ||
"unit_id_pudl", | ||
"boiler_id", | ||
] | ||
].drop_duplicates(), | ||
how="outer", | ||
on=["plant_id_eia", "generator_id"], | ||
validate="m:1", | ||
on=["plant_id_eia", "generator_id", "boiler_id"], | ||
validate="m:m", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this merge is happening pre-network x, it seems like we should probably revert these changes no? so there is no boiler_id pre-network_x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it really matters? Because we need to merge with the bga table for the unit_id
which is used in the network x process. We could drop the boiler_id more explicitly before the process, but it wouldn't make a huge difference.
PR Overview
note: this pr does not YET also update the alembic schema or remove the original epacamd_eia table from the db.
i think there are two high-level steps needed:
epacamd_eia_subplant_ids
andepacamd_eia
. i don’t think we need to keep them both as database tables if we are preserving the boiler id in theepacamd_eia_subplant_ids
table. So I believe we could removeepacamd_eia
as a db table.epacamd_eia
->clean_epacamd_eia
&epacamd_eia_subplant_ids
->epacamd_eia
.metadata.resources.glue.RESOURCE_METADATA
PR Checklist
dev
).